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ABSTRACT 

The  present  technical  report  describes  current  progress  made  in  designing  an  Intelligent  Support  Systan 
(ISS)  for  automatic  data  analysis  (i.e.,  knowledge  acquisition)  and  efficient  data  exploitation  in  numerical 
scientific  databases,  with  an  application  to  the  CAST-NEONS  environmental  databases  used  for  ocean 
modeling  and  prediction.  First,  we  explain  the  improvements  made  to  the  Automatic  Data  Analysis  System 
(ADAS)  by  combining  the  features  of  two  machine  learning  techniques  (i.e.,  data  clustering  and  inductive 
iearning  by  decision  tree)  to  generate  sets  of  production  rules  that  eSciently  describe  the  observational 
raw  data  contained  in  the  scientific  databases.  Data  clustering  allows  the  system  to  classify  the  raw  data 
into  clusters,  which  the  system  learns  by  induction  to  build  the  decision  trees,  &om  which  are  deduced 
the  production  rules.  Second,  we  describe  the  design  of  a  robust  computational  model  of  conversation 
(ENTRETIEN)  for  interactive  nsturai  language  man-machine  interfacing,  based  on  case  grammar  theory. 
Such  a  system  allows  the  user  to  naturally  communicate  with  the  system  for  efficient  and  effective  information 
retrieval.  ENTRETIEN  conducts  the  conversation  to  recognise  the  user’s  intentions  (i.e.,  user’s  plans).  To 
include  the  automatically  generated  production  rules  into  the  knowledge  base  of  an  expert  system  and  to 
add  the  facilities  of  the  natural  language  interface  represent  our  Snal  goal  in  building  an  intelligent  support 
system  for  efficient  exploitation  of  large  scientific  databases. 


1  Needs  for  an  Intelligent  Support  System 

Scientific  databases  constantly  and  rapidly  grow  in  terms  of  the  amount  of  gathered  knowledge.  This  huge 
flood  of  raw  knowledge  is  mainly  numerical  data  representing  observations  (e.g.,  satellite  infrared  images, 
ship  and  storm  tracks,  etc.).  It  is  predicted  to  be  of  the  order  of  gigabit  or  terabit  of  new  raw  data  per  day. 
These  fast  growing  scientific  databases  render  impossible,  even  unfeasible,  human  raw  data  analyu;  hence 
the  urgent  need  for  automatic  data  analysis  [PSF91]  and  efficient  human-machine  interface. 

Our  research  project  responds  then  to  the  need  to  find  knowledge  in  the  flood  of  data  and  to  efficiently 
allow  the  user  to  access  and  use  this  knowledge.  It  is  then  concerned  in  filling  the  gap  between  data  generation 
and  data  understwding  for  efficient  and  effective  data  exploitation,  such  as  retrieval  of  information  and 
inferences  on  current  knowledge.  Our  approach  is  for  the  system  to  discover,  in  an  automatic  manner, 
relationships  among  the  objects  contained,  or  to  be  inserted,  in  the  databases.  This  corresponds  to  finding 
interesting  and  useful  patterns  in  the  raw  observational  data  stored  in  the  databases.  Such  patterns  represent 
an  effective  and  efficient  description  of  the  data.  IVansforming  those  patterns  into  production  rules  to  be 
inserted  in  the  knowledge  base  of  an  expert  system,  and  adding  the  facilities  and  power  of  an  interactive 
conversational  natural  language  interface,  will  allow  the  system  to  do  efficient  exploitation  of  large  scientific 
databases.  Designing  and  implementing  such  an  Intelligent  Support  System  (ISS)  is  our  final  goal  in  this 
research.  The  overall  ISS  functional  components  are  depicted  in  Figure  1. 


Figure  1;  ISS  functional  components 

The  second  section  of  this  report  is  concerned  with  the  discovery-learning  process  of  ISS.  It  describes 
ADAS,  an  Automatic  Data  Analysis  System,  which  combines  the  strengths  of  two  machine  learning  tech¬ 
niques,  that  are  data  clustering  and  induction  learning  by  decision  trees.  The  intent  is  to  generate  production 
rules  that  describe  the  raw  data  contained  or  being  included  in  the  scientific  databases.  It  continues  and 
enhances  previous  work  in  which  a  system  for  knowledge  discovery  in  scientific  databases  was  designed  and 
implemented  using  data  clustering,  classification  heuristics,  and  domain  knowledge  [PP94].  This  previous 
system  analyzed  the  raw  data  in  order  to  find  structure  in  the  data.  This  was  done  by  partitioning  the 
objects  space  into  a  hierarchy  of  clusters,  each  representing  a  class  of  objects  that  are  considered  to  describe 
a  similar  concept.  The  clustering  was  performed  using  some  pre-defined  classification  criteria  (such  as  object 
features),  some  heuristics,  and  some  domain  knowledge.  The  present  research  adds  to  this  basic  system  a 
set  of  adgorithms  to  make  a  general  description  of  each  cluster.  Such  descriptions  are  expressed  in  first  order 
predicate  logic  as  production  rules.  These  production  rules  are  intended  to  be  used  by  an  expert  system 
for  efficient  exploitation  of  the  knowledge  contained  in  scientific  databases  (e.g.,  inference  and  information 
retrieval).  These  algorithms  are  based  on  inductive  learning  by  decision  trees  and  automatic  rule  genera- 
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tion  and  were  implemented  by  Quinlan  in  the  C4.5  package  [Qui93].  The  remainder  of  the  second  section 
presents  the  results  of  experiments  conducted  in  applying  ADAS  to  CAST-NEONS,  a  real  world  scientific 
database  currently  used  by  the  Center  for  Air  Sea  Technology  (CAST)  for  ocean  modeling  and  prediction. 
By  these  experiments,  it  is  shown  that  the  system  was  able  to  identify  automatically,  from  scratch,  the  Gulf 
Stream  in  the  North  Atlantic  area  by  analyzing  and  classifying  measures  of  its  features,  such  as  salinity  and 
temperature.  This  was  done  at  constant  depth  but  we  intend  to  enable  the  system  to  search  more  interesting 
correlations  such  as  functions  of  salinity,  temperature,  depth,  and  temporal  variables.  By  these  examples, 
we  empirically  demonstrate  the  feasibility  of  our  approach  in  doing  automatic  numerical  data  analysis. 

The  third  section  discusses  the  effective  exploitation  of  the  knowledge  by  the  user.  Discoveries  are  appro¬ 
priately  formatted  by  the  system  for  the  intended  user.  In  the  first  stage,  this  is  done  as  first  order  predicate 
logic  rules  obtained  from  decision  trees.  This  is  a  convenient  knowledge  representation  for  knowledge  ma¬ 
nipulation  and  analysis,  and  is  easily  transformable  into  natural  language  for  human  user  data  exploitation. 
This  section  discusses  the  design  of  an  interactive  and  robust  natural  language  conversationsd  simulation 
model  we  named  ENTRETIEN.  Since  language  is  a  natural  support  for  human  being  communication  for 
transfer  of  information,  ENTRETIEN  will  allow  the  human  user  to  naturally  communicate  with  the  system 
to  efficiently  exploit  the  potential  information  non-trivially  stored  in  the  large  scientific  databases.  This  is  to 
palliate  the  actual  obvious  lack  of  efficient  and  effective  communication  between  human  users  seeking  infor¬ 
mation  and  computer  systems  willing  to  provide  such  knowledge.  ENTRETIEN  will  conduct  the  dialogue 
with  the  user  by  analyzing  each  user’s  utterance  (e.g.,  query,  command,  etc.)  expressed  in  a  natural  way 
(e.g.,  English)  and  in  the  current  context  of  the  conversation  to  better  serve  the  user’s  needs.  This  is  based  on 
human  natural  language  information  seeking  dialogues  during  which  each  participant  attempts  to  recognize 
the  other  participant  intentions  or  plans  for  successfully  communicating  information.  Our  system  will  thus 
interactively  conduct  the  dialogue  to  recognize  the  user’s  intentions  in  seeking  information  in  the  scientific 
numerical  databases.  We  want  to  base  our  plan  recognition  simulation  model  on  a  previous  model  developed 
by  Carberry  [Car90].  This  model  seems  to  be  the  most  appropriate  current  computational  simulation  model 
for  further  enhancements  towards  a  more  robust  system  that  will  be  a  step  closer  to  the  human  way  of 
handling  information  seeking  dialogues.  Such  model  of  conversation  is  based  on  standard  natural  language 
processing  models  that  simulates  human  cognitive  processes  in  understanding  wd  generating  utterances  in 
natural  language  such  as  English.  Our  natural  language  system  (NLS)  is  based  on  case  grammar  [Fil68] 
with  case  frames  in  a  semantic  conceptual  network  paradigm.  Such  NLS  is  enhanced  with  other  cognitive 
processes  as  well  as  pragmatics  and  planning  strategies  to  efficiently  determine  the  meaning  of  each  current 
utterance  in  the  context  of  the  actual  dialogue  and  to  recognize  the  plams  of  the  user  dialogue  participant. 

In  the  last  sections,  this  report  presents  future  directions  toward  the  goal  of  designing  and  implementing 
the  intelligent  support  system.  This  includes  the  design  of  the  knowledge  base  updating  system,  the  inte¬ 
gration  of  the  productions  rules  and  a  previous  semantic  network  that  captures  the  domain  knowledge  into 
an  expert  system  knowledge  base,  the  implementation  of  ENTRETIEN  for  the  system  to  naturally  converse 
with  human  users,  the  selection  of  better  data  clustering  algorithms  to  refine  auid  render  more  robust  the 
knowledge  discovery /learning  process,  and  finally  to  encode  the  knowledge  found  into  a  generic  form  suitable 
for  the  knowledge  base. 
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2  Discovery-Learning  Process 

This  section  describes  the  discovery  and  learning  processes  of  the  Intelligent  Support  System  (ISS)  im¬ 
plemented  in  the  Automatic  Data  Analysis  System  (ADAS).  Scientific  databases  mainly  contain  raw  data 
corresponding  to  observations  of  some  parameters.  Knowledge  discovery  in  such  databases  is  done  by  anal3rs- 
ing  and  classifying  observations  into  clusters  to  obtain  "implicit,  previously  unknown,  and  potentially  useful” 
new  information  [PSF91].  New  information,  or  patterns,  are  interesting  and  certain  enough  descriptions  of 
the  relationships  among  clusters  of  observational  data  as  well  as  descriptions  of  the  main  characteristics  of 
each  cluster.  To  determine  whether  a  pattern  is  interesting  is  based  on  criteria  imposed  by  domain  and 
background  knowledge.  In  addition,  it  has  to  be  novel,  useful  regarding  the  domain  of  application,  and  the 
discovery  process  must  be  non-trivial  (i.e.,  the  system  takes  autonomous  decision  on  how  to  process  the 
data  and  evaluate  the  results).  Piatetsky-Shapiro  and  Frawley  added  that  knowledge  is  considered  useful 
if  it  facilitates  the  system,  or  the  user,  to  satisfy  a  given  goal  [PSF91];  For  example,  identifying  the  Gulf 
Stream  features  in  the  North  Atlantic  to  infer  predictions  (see  application  examples  in  Section  2.5).  Patterns 
ue  described  using  first  order  predicate  logic  (i.e.,  production  rules)  because  it  is  well  known  that  it  is  a 
deterministic  manner  to  encode  knowledge.  It  can  also  be  easily  transformed  into  a  high  level  language,  such 
as  a  subset  of  English,  to  be  understood  by  human  users,  which  is  the  main  goal  in  discovering  knowledge. 
Extraction  of  knowledge  by  ADAS  from  non-trivially  stored  raw  observationsd  data  is  mainly  done  in  two 
steps: 

•  First,  the  system  must  identify  interesting  patterns  underlined  in  the  raw  data.  This  is  called  the 
identification  process  [PSF91],  during  which  ADAS  clusters  the  instance  space  into  subclasses  that 
reflect  regularities  inherent  in  the  observational  data.  Our  approach  is  to  use  mriltivariate  data  analysis 
and  unsupervised  machine  learning  techniques,  such  as  data  clustering.  This  phase  follows  a  prior  study 
[PP94]  in  which  we  used  mainly  statistical  data  analysis  (e.g.,  data  clustering  techniques)  to  discover 
hidden  knowledge,  or  useful  regularities  in  the  data  (see  also  [ZB91}  and  (PSF91]).  The  previous 
classification  used  the  Euclidean  distance  in  a  N-space  among  objects’  features  to  cluster  similar  objects 
into  a  hierarchy  of  classes.  The  search  in  what  to  look  for  is  guided  by  heuristics,  domain  knowledge, 
and  the  user’s  current  interests  (e.g..  Gulf  Stream  features,  as  shown  in  the  experiments). 

•  Second,  the  system  must  describe  these  clusters  into  a  concise  and  meaningful  representation  for  the 
intended  user,  such  as  first  order  predicate  logic  production  rules  for  use  by  an  expert  system  shell  or 
natural  language  sentences  for  use  by  a  human  user.  This  phase  is  called  the  descriptive  process 
[PSF91],  and  it  summarizes  the  relevant  features  that  describe  each  cluster.  This  is  related  to  supervised 
machine  learning  techniques. 

One  of  the  main  issues  of  developing  ADAS  is  to  use  computationally  efficient  algorithms  to  deal  with 
large  volumes  of  data  and  to  use  some  domain  knowledge  and  heuristics  to  guide  the  search  for  hidden 
information. 

The  Automatic  Data  Analysis  System  is  composed  of  three  main  components  as  depicted  by  Figure  2. 
The  dataset  of  training  instances  (a  selection  of  raw  observational  data),  which  forms  the  N-dimensional 
data  space,  is  first  statistically  analyzed  or  searched  for  structure  (identification  process).  This  is  done  by 
data  clustering.  The  data  space  is  partitioned  into  clusters  (or  classes)  such  that  there  is  both  maximal 
dissimilarity  between  clusters  and  maximal  similarity  between  instances  in  each  cluster.  Since  these  two 
characteristics  are  contradictory,  trade-offs  must  be  found.  These  clusters  become  classes  of  the  instances 
for  the  supervised  learning  phase  (inductive  learning).  In  the  inductive  learning  process,  each  cluster  is 
described  by  a  decision  tree  (descriptive  process),  from  which  production  rules  are  generated.  Each  of  these 
components  is  now  described. 

Following  the  requirements  to  design  systems  for  knowledge  discovery  in  databases  that  were  suggested 
during  the  workshop  on  Knowledge  Discovery  in  Databases  at  the  Eleventh  International  Joint  Conference 
on  Artificial  Intelligence  [PSF91],  the  ADAS  includes  the  four  following  characteristics: 

1.  High-level  language.  To  be  considered  knowledge,  findings  must  be  understood  by  both  machine 
and  human,  hence  the  use  of  production  rules  that  can  be  transformed  into  well  understood  English 
sentences  (see  the  design  of  ENTRETIEN  in  Section  3). 
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Figure  2:  ADAS’  functional  components 


2.  Accuracy.  Certainty  factors  are  attached  to  data  to  ensure  the  reliability  in  the  findings. 

3.  Interesting  patterns.  Findings  must  be  evaluated  ’’interesting”  to  be  kept  in  the  knowledge  base  as 
new  knowledge  of  interest  for  the  application  domain.  This  is  decided  according  to  a  combination  of 
heuristics  and  domain  knowledge. 

4.  Efficiency.  The  discovery  process  should  be  predictable  and  computationally  acceptable  when  ana¬ 
lyzing  large  amount  of  data. 

A  discussion  of  those  characteristics  is  considered  throughout  the  following  subsections,  but  we  first  de¬ 
scribe  the  functional  components  that  compose  ADAS  as  well  as  the  strategies  used  to  analyze  the  data. 
Then,  we  discuss  a  simple  approach  to  handle  uncertainty  in  the  discoveries,  and  finally,  we  present  experi¬ 
ments  conducted  on  gridded  data  from  the  CAST-NEONS  scientific  databases. 

2.1  Data  clustering:  finding  structure  in  the  data. 

This  section  briefly  summarizes  previous  work  done  by  Perrin  and  Petry  [PP94].  Multivariate  data  analysis, 
also  called  data  clustering,  is  intended  to  analyze  the  statistical  behavior  of  the  objects  of  a  given  set  to  get 
complete  information  on  them.  It  facilitates  greatly  human  comprehension  of  raw  observational  data.  It  is 
believed  that  with  enormous  volumes  of  data,  one  can  rely  on  statistical  analysis  methods  to  describe  them 
[Gre85]. 


2.1.1  General  approach. 

Our  approach  to  describe  the  set  of  raw  observational  data  uses  data  clustering  to  find  structure  (functional 
relationships)  in  the  data.  Such  structure  is  used  to  hierarchically  classify  the  instances  (see  Figure  3).  From 
this  hierarchy  are  derived  a  number  of  clusters,  each  contuning  objects  that  describe  the  same  concept, 
according  to  some  criteria,  such  as  object  features.  Some  domain  knowledge  and  heuristics  are  used  during 
the  data  clustering  process  to  avoid  finding  unnecessary  relationships  between  every  pair  of  objects  and  to 
keep  high  computational  efficiency.  This  is  justified  by  the  fact  that  some  relations  are  known  not  to  be 
of  interest  for  the  particular  application,  and  thus  the  system  saves  computational  effort  and  time  by  not 
processing  them.  This  is  also  to  palliate  the  problem  of  deciding  which  features  are  to  be  taken  in  account  for 
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clustering  [PP94].  Similarity  between  instances  is  done  by  computing  the  Euclidean  distance  between  every 
pair  of  attributes.  Note  that  this  resemblance  coefficient  (i.e.,  Euclidean  distance)  handles  missing  values  in 
the  data  matrix.  The  larger  the  degree  of  similarity  (i.e.,  distance)  is  between  two  instances,  the  less  similar 
these  instances  are.  The  obtained  resemblance  data  matrix,  which  groups  similarity  measures  between  every 
pair  of  instances,  is  applied  to  a  clustering  algorithm  that  builds  a  hierarchical  tree  of  the  instances.  This 
tree  is  then  analyzed  and  interpreted  to  decide  where  to  cut  it;  or  to  decide  how  many  clusters  have  to  be 
considered  (see  Section  4.1). 


of  instances 
(raw  observation^  data) 


Classes  of  instances 
describing  the  data 


Figure  3:  Data  clustering  process 


2.1.2  Implementation. 

Multivariate  data  analysis  and  data  clustering.  We  used  the  Cluster  2.7  package  designed  by  Yoshiro 
Miyata  and  Andreas  Stolcke  at  Berkeley  University,  and  freely  distributed  [MS93].  The  clustering  algorithm 
used  in  this  package  is  a  variant  of  the  Single  Linkage  Clustering  Anadysis  (SLCA).  According  to  Stolcke, 
the  main  difference  between  Cluster  2.7  and  SLCA  is  that  inter-cluster  distances  are  computed  using  the 
cluster-means  rather  than  the  minimal  distance  between  cluster  members  (see  Section  4.2). 

Determining  the  number  of  clusters.  To  use  C4.5  on  clustered  objects,  we  needed  to  come  up  with 
a  way  of  assigning  classes  to  the  clusters  produced  by  Cluster  2.7.  The  approach  taken  was  to  restrict 
the  number  of  classes  to  a  power  of  2.  The  hierarchy  produced  by  Cluster  2.7  is  a  binary  tree  with  its 
leaves  representing  the  objects  extracted  from  the  scientific  database.  Given  the  binary  tree  structure 
produced  by  Cluster  2.7,  the  class  assignment  algorithm  can  easily  traverse  the  tree  in  an  in-order  method 
2md  easily  add  objects  to  a  class  by  simply  determining  its  current  level.  If  the  current  level  is  greater  than 
log2(num6er^/_c/asses)  then  it  would  add  the  object  to  the  current  class.  If  the  algorithm  moves  to  a  level 
which  is  less  than  Xog^inumberjaf  xlasaes)  and  then  traverse  back  down  then  it  would  increment  the  class 
by  1  and  start  adding  objects  to  the  new  class.  The  algorithm  continues  in  a  recursive  manner  until  it  has 
traversed  the  entire  tree. 

Although  this  approach  is  very  fast  and  straightforward,  it  has  a  tendency  to  produce  classes  with  minimal 
amounts  of  information.  In  the  examples  mentioned  later,  it  will  be  seen  that  class  Cl  covers  only  a  small 
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region  of  the  grid  compared  to  clan  C2  or  C3.  A  better  algorithm  would  put  CO  and  Cl  in  the  nine  clan 
and  keep  C2  and  C3  aa  separate  flniwm  We  intend  to  explore  a  different  approach  to  perform  this  task  in 
future  work  (see  Section  4.1). 

2.2  Decision  trees  -  describing  the  clusters. 

2.2.1  General  approach. 

Although  the  Cluster  2.7  package  is  very  powerful  at  finding  clusters,  one  can  not  produce  a  set  of  ruin 
from  or  extract  information  about  the  clusters.  The  C4.5  machine  learning  package,  developed  by  Quinlan 
[Qui93],  will  produce  decision  tren  that  production  ruin  can  readily  be  extracted  from.  Like  Cluster 
2.7,  C4.5  performs  classification.  A  two  step  procen  using  Cluster  2.7  and  C4.5  is  needed  because  C4.5 
requim  pre-defined  and  discrete  classn  to  perform  supervised  learning.  Cluster  2.7  providn  the  clustering 
information  needed  to  derive  pre-defined  and  discrete  classn. 

C4.5  builds  decision  tren  in  a  divide  and  conquer  approach  based  on  E.B.  Hunt’s  method  for  constructing 
decision  tree  from  a  set  of  training  casn  [Qui93].  The  approach  is  simple  and  works  as  follow.  Using  a  set  T 
of  training  casn,  let  the  classn  be  denot^  ...,Cfc}.  There  are  three  ponibilitin  for  the  training 

casn: 

1.  All  of  the  training  casn  belong  to  a  single  class  Cj.  Thus,  the  decision  tree  is  a  single  leaf  identifying 
class  Cj . 

2.  The  training  set  T  contains  no  casn.  Hence,  the  decision  tree  is  just  a  leaf  but  the  class  associated 
with  the  leaf  is  determined  from  information  outside  of  the  training  casn. 

3.  T  contains  casn  belonging  to  many  classn.  In  this  case,  the  goal  is  to  head  towards  single-clan 
collections  of  casn.  A  tnt,  based  on  a  single  attribute,  is  chosen  so  that  there  are  mutually  exclusive 
outcomn  {0i,02,0z,  ■■■On}-  T  is  then  partitioned  into  subsets  Ti,T3,T3,  ■■■,Tn,  where  Tt  contains  all 
the  casn  of  T  that  have  the  outcome  Oi  of  the  tnt.  This  is  done  recursively  on  each  subset  Ti  [Qui93]. 

2.2.2  Tests  to  partition  test  cases. 

Finding  tnts  to  partition  the  tnt  casn  is  an  important  task.  The  purpose  is  not  to  simply  partition  the 
tnt  casn  but  to  help  build  a  decision  tree  that  accurately  reflects  the  domain  structure  and  the  predictive 
powers  associated  with  knowing  the  propertin  of  the  domain.  Ideally,  tnts  should  be  chosen  at  each  stage 
to  produce  a  small  final  tree  [Qui93]. 

Unfortunately,  as  Hyafil  and  Rivnt  showed,  finding  the  smallnt  decision  tree  consistent  with  a  training 
set  is  NP-complete  [Qui93].  The  method  used  in  C4.5  to  construct  decision  tren  is  a  non  backtracking, 
greedy  algorithm.  When  a  tnt  has  been  selected  to  partition  the  current  set  of  casn,  the  choice  is  set  and 
alternative  choicn  are  not  explored. 

C4.5  usn  a  gain  ratio  criterion  for  evaluating  tnts.  The  gain  ratio  criterion  is  a  modification  of  the  gain 
criterion  used  in  the  ID3  algorithm  (cf.  Appendix  7.1).  Quinlan  found  that  the  gain  criterion  had  a  strong 
bias  which  favored  tnts  with  many  outcomn.  Quinlan  usn  the  hypothetical  situation  of  a  medicd  diagnosis 
task  where  one  of  the  attributn  happens  to  contain  a  unique  patient  identification.  Partitioning  the  training 
casn  based  on  this  attribute  will  produce  a  large  number  of  subsets  containing  one  case.  The  gain  ratio 
criterion  approach  fixn  this  bias  by  normalizing  the  gain  attributed  to  tnts  with  many  outcomn.  This  is 
done  using  the  equation  spUtJnfo^X)  =  x  log]  which  represents  the  potential  information 

generated  by  dividing  T  into  n  subsets,  while  the  information  gain  measum  the  information  relevant  to 
classification  that  arisn  from  the  same  division  [Qui93}.  Now,  gainjratio(X)  =  the 

proportion  of  information  that  is  helpful  for  classification.  To  avoid  unstable  ratios  produced  by  trivial  splits, 
the  gain  ratio  criterion  choosn  a  tnt  that  will  maximize  this  ratio  with  the  constraint  that  the  information 
gain  is  large  [Qui93]. 

In  the  medical  diagnosis  example  previously  mentioned,  Quinlan  statn  that  the  patient  identification 
attribute  will  not  be  ranked  highly  using  the  gain  ratio  criterion. 
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2.3  Generation  of  production  rules. 

The  process  of  generating  production  rules  from  a  decision  tree  is  a  straightforward  process.  Each  path  in 
the  decision  tree  corresponds  to  an  initial  rule.  The  left  hand  side  of  the  rule  is  the  set  of  conditions  that 
follow  the  path  along  the  tree  to  the  leaf.  The  right  hand  side  of  the  rule  represents  the  leaf.  Conditions 
are  removed  from  each  rule  that  do  not  help  distinguish  the  selected  class  from  the  other  classes  using  a 
pessimistic  estimate  of  rule  accuracy.  The  next  step  is  to  delete  simplified  rules  that  do  not  contribute  to 
the  accuracy  of  the  set  of  rules.  After  this  step,  the  simplified  rules  are  ordered  to  minimize  false  positive 
errors.  A  default  class  is  chosen  also  during  this  last  step  [Qui93]. 

2.4  Uncertainty  in  the  findings. 

Certainty  is  a  measure  of  the  degree  of  belief  the  system  has  in  the  findings.  This  is  an  important  issue 
in  knowledge  discovery  that  must  be  considered  to  assess  the  credibility  and  the  quality  of  the  findings,  as 
Piatetsky-Shapiro  and  Frawley  noted  "without  sufficient  certainty,  patterns  become  unjustified  and,  thus, 
faul  to  be  knowledge”  [PSF91].  Each  object  and  discovery  in  ADAS  is  associated  with  a  computed  certainty 
factor  (CF),  corresponding  to  the  certainty  in  the  data.  It  measures  the  integrity  of  the  data  objects  used  for 
the  discovery.  Data  in  a  database  can  be  missing,  noisy,  or  inapplicable,  and  this  influences  on  the  reliability 
of  the  findings.  The  certainty  factor  approach  is  justified  by  the  fact  that  it  is  well  suited  when  the  knowledge 
is  in  rule  format,  and  it  is  a  simple  computational  model  efficient  enough  for  the  CAST  scientific  database 
[GD93]. 


2.4.1  General  approach. 

From  the  database  data  dictionary,  a  specific  feature  of  an  object  is  expected  to  have  a  certain  content  (e.g., 
vedue,  etc.).  When  such  a  content  is  controversial,  this  has  to  be  reflected  on  the  certainty  of  the  whole 
object.  First,  our  approach  is  to  consider  the  object  to  be  fully  reliable  (i.e.,  certainty  CF=1.0).  Second,  this 
certainty  is  modified  when  analyzing  each  characteristic  of  the  object  to  determine  its  validity  (e.g.,  missing 
fields,  improper  value,  etc.). 

Data  in  CAST-NEONS  database  are  mainly  observations.  There  is  no  background  knowledge  (i.e., 
domain  knowledge)  that  can  be  used  to  compute  the  confidence  factor  in  the  object.  Hence,  we  use  the 
following  simple  approach.  The  object  data  certainty  is  computed  from  eiwi  of  its  N  feature  certainties  (CFi). 
Features  ue  composed  of  main  features  and  sub-features,  with  the  sub-feature  certainties  contributing  to 
their  parent  feature.  For  example,  a  grid  object  has  8  mrun  features,  that  are  model,  geometry,  version,  base 
etm,  parameter,  level  type,  level  value,  and  grid  values.  The  main  feature  grid  values  are  a  series  of  grid 
measures  (each  being  a  sub-feature).  The  contribution  (Cj)  of  each  of  the  N  features  is  C,  =  The  data 
certainty  is  CF  =  where  -1  <  CF  <  1.  Note  that  CF=-1  means  a  maximum  disbelief  on  the  data, 

CF=:0  means  that  it  is  not  possible  to  decide  whether  data  is  reliable  or  not.  and  CF=1  means  a  maximum 
belief  on  the  data.  CFi  is  the  certainty  of  feature  JVi  smd  is  determined  as  follows: 

CFi  =  10  (maximum  belief)  if  the  feature  content  is  what  is  expected  (valid  type,  range,  etc.).  This  is 
determined  by  checking  the  background  knowledge  (e.g.,  database  data  dictionary)  that  contains  the 
description  and  constraints  on  each  attribute. 

CFi  =  0.0  (cannot  decide)  if  content  is  null  (case  of  missing  value).  There  is  no  evidence  to  decide  whether 
there  is  belief  or  disbelief  of  this  feature. 

CFi  =  -1.0  (maximum  disbelief)  if  content  is  not  what  is  expected  (case  of  inappropriate  or  noisy  data). 
Note  that  when  feature  Ni  is  divided  into  M  sub-features,  then  CFi  =  *  zJ>=i 

2.4.2  Example. 

The  following  example  illustrates  how  to  compute  the  certainty  CF  in  the  data.  Let  a  grid  object  be  defined 
by  N=:8  main  features,  feature  8  being  defined  by  M=15900  sub-features  (each  is  a  measure  at  a  given 
longitude  and  latitude  of  the  grid).  The  following  table  defines  each  feature: 
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fT 

Feature 

Content 

CTi 

dyn-6lm6 

valid  content 

1.0 

1/8 

geometry 

xys48 

not  valid  (unknown  to 
background  knowledge) 

-1.0 

-1/8 

version 

null 

missing  value 

0.0 

base  etm 

valid  content 

1.0 

1/8 

5 

parameter 

salinity 

valid  content 

1.0 

1/8 

6 

level  type 

sgma-lvl 

valid  content 

1.0 

1/8 

7 

level  value 

6. 

valid  content 

l.O 

1/8 

8 

grid  values 

seri<M  of 

15000  values 

suppose  there  are  2 
values  out  of  range 
(noisy  data) 

14998/15000 
=  0.9998666 

0.124 

Hence,  the  overall  certainty  on  the  data  of  this  grid  object  is  CF=0.624,  and  it  represents  the  degree  of 
reliability  the  system  has  on  this  grid  object. 

2.5  Application  to  CAST/NEONS  databases  -  Examples. 

2,5.1  CAST-NEONS  databases  and  grid  data. 

The  Center  for  Air  Sea  Technology  (CAST)  manages  large  scientific  environmental  databases  called  CAST- 
NEONS.  The  implementation  is  bas^  on  the  Empress  Relational  Database  Management  System  (Empress 
RDBMS)  ^  interfaced  with  the  Naval  Environmental  Operational  Nowcast  System  (NEONS)  *.  NEONS 
eases  the  management  of  the  Empress  RDBMS,  allows  one  to  query  the  databases  via  embedded-SQL  C 
functions,  and  facilitates  the  display  and  the  analysis  of  environmental  data. 

CAST-NEONS  databases  also  contain  environmental  data  from  National  Oceanic  and  Atmospheric  Ad¬ 
ministration  (NOAA)  and  Navy  archives.  It  is  used  for  research  and  evaluation  of  ocean  modeling  and  pre¬ 
diction,  as  well  as  ocean  features  studies  such  as  eddy  resolution,  currents,  surface  structure,  etc.  [NOA921. 
The  environmental  data  is  composed  of  five  generic  data  types: 

•  Image  data:  satellite  multi-band  images  and  image  overlays. 

•  Latitude-Longitude-Time  data  (LLT):  specific  measures  (from  buoys,  ships,  etc.)  of  one  or  more 
parameters  established  by  position  (latitude  and  longitude)  and  time. 

•  Line  data:  a  series  of  observational  points  along  a  (curved)  line  representing  storm  tracks,  etc. 

•  Grid  data:  values  of  measures  at  coordinates  established  by  a  grid  of  points  at  a  given  time  and  level. 
Grid  data  are  4-dimensional:  latitude,  longitude,  time,  and  parameter  measured. 

•  Volume  data:  5-dimensional  set  of  related  grid  data. 

The  present  study  will  focus  on  the  grid  data  type.  This  choice  is  a  good  trade-off  in  using  a  sufficiently 
complex  data  type  to  show  the  feasibility  of  the  study  and  to  develop  a  quick  prototype.  Grid  data  contain 
atmospheric  and  oceanographic  numerical  outputs,  user-defined  products,  and  grided  climatology.  They  are 
organized  into  model  types.  Each  grid  is  stored  in  the  databases  by  grid  field,  i.e.,  one  parameter  at  a  given 
time  for  adl  the  points  of  the  grid.  A  grid  instance  is  uniquely  identified  by  the  following  eight  fields: 

•  Grid  model  type 

•  Grid  model  version  name 

•  Grid  model  geometry  name 

^©EmpreM  Softwan,  Inc. 

^Developed  by  NOARL 
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•  Geophysical  parameter  name  measured 

•  Level  type  at  which  the  parameter  is  measured 

•  Level  value  at  which  the  parameter  is  measured 

•  Base  time  (date  and  hour)  when  the  measured  were  taken 

•  Forecast  period 

In  addition  to  these  fields,  there  is  associated  a  bit  stream  of  values  to  the  grid.  It  is  variable  in  length 
for  each  grid  and  depends  on  the  grid  model.  It  is  the  set  of  actual  measure  values  of  the  given  parameter 
at  the  given  level  for  every  position  (latitude  and  longitude)  determined  by  the  grid  model. 

2.5.2  Example  firom  CAST/NEONS:  salinity  and  temperature  in  the  Gulf  Stream  •  Generid 
approach. 

We  present  here  three  of  the  experiments  we  conducted  to  test  ADAS.  They  correspond  to  gridded  data  in 
the  Gulf  Stream  (North  Atlantic  area)  from  three  consecutive  days  (day  1:  July  15,  day  2:  July  16,  and 
day  3:  July  17  of  1991)  at  a  constant  depth  (2.5  meters  below  the  sea  surface).  The  goal  is  for  the  system 
to  automatically  determine  Gulf  Stream  features  only  by  studying  the  correlation  of  depth,  salinity,  and 
temperature,  at  given  grid  locations  (longitude,  latitude). 

To  do  so,  some  training  instances  corresponding  to  the  grid  model,  dates  and  depth  and  measurements  of 
salinity  emd  temperature  were  selected  &om  CAST-NEONS  databases  (see  Figure  4  for  the  training  instance 
object  template  definition).  Both  salinity  and  temperature  grids  were  combined,  then  divided  into  15000 
small  cells  that  represent  a  training  instance. 

/  s 

Training  matoncw _ 

depth 

salinity 

temperature 

position;  longitude,  latitude 
dote _ 

Figure  4:  Training  instance  template 

Unsupervised  learning  (i.e.,  data  clustering)  allows  one  to  determine  to  which  class  a  training  instance 
belongs  to.  Applying  the  Cluster2.7  algorithms  on  the  instance  space  produces  a  hierarchy  of  instances 
classified  by  how  close  they  are  from  each  other  in  the  space  (we  used  Euclidean  distance).  Deciding 
arbitrarily  on  the  number  of  clusters  to  keep  (four  for  the  present  three  cases),  we  were  able  to  assign  a  class 
to  each  training  instance.  Applying  the  C4.5  algorithms  on  the  training  instance  dataset  resulted  in  decision 
trees  used  to  generate  production  rules.  Each  of  the  three  experiments  shows: 

•  A  2D  representation  of  the  instance  space  (i.e.,  grid)  that  indicates  which  instance  were  grouped 
together  by  data  clustering.  In  each  case,  the  system  classified  the  data  into  four  main  clusters  (cf. 
Figures  5,  8,  and  11). 

•  Two  3D  representations  for  the  salinity  (cf.  Figures  6,  9,  and  12)  and  the  temperature  (cf.  Figures 
7,  10,  and  13)  showing  graphically  the  role  of  each  cluster  in  terms  of  the  parameter  measured. 

•  The  decision  tree  obtained  with  C4.5  that  describes  logically  the  information  contained  in  the  previous 
pictures  (cf.  appendices  7.2,  7.4,  and  7.6). 

•  And  finally  the  production  rules  that  were  generated  from  the  decision  tree.  These  rules  describe  the 
findings  of  the  system  (cf.  appendices  7.3,  7.5,  and  7.7). 
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Figure  8;  Day  2:  Inatance  space  2D  representation 
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Figure  13;  Day  3;  Temperature  3D  representation 


2.5.3  Conclusion. 

These  experiments  show  that  from  raw  observational  data  contained  in  the  scientific  databases,  it  was 
possible  to  extract  knowledge  or  useful  information  hidden  in  the  data.  We  found  correlations  between 
salinity,  temperature,  and  depth  as  shown  in  the  generated  production  rules.  This  allows  one  to  define  some 
features  of  the  Gulf  Stream. 

We  are  currently  experimenting  the  learning  behavior  of  ADAS  when  the  system  considers  several  depths 
over  a  period  of  time  (several  days).  We  expect  to  find  promising  results  identifying  the  Gulf  Stream  features. 
Storing  these  rules  in  an  expert  system  and  using  domain  knowledge  (heuristics),  we  expect  the  system  to 
be  able  to  predict  Gulf  Stream  features  after  learning  prior  features  over  a  period  of  time. 

3  Interactive  Natural  Language  Interface 

The  ability  to  communicate  with  databases  in  a  natural  and  convenient  way  for  the  human  users  (i.e.,  using 
a  natural  language,  such  as  English)  is  of  great  interest  for  very  large  scientific  databases.  Extending  the  idea 
of  a  system  capable  to  only  handle  and  answer  single  query,  we  are  convinced  that  handling  a  conversation 
between  the  human  user  and  the  system  will  enable  the  machine  to  better  understand  and  refine  the  user’s 
intended  goals  or  needs.  By  conversation,  we  mean  an  interactive  and  intelligent  domain  specific  dialogue 
man-machine.  Using  a  natural  language  system  will  allow  to  explain  clearly  the  findinp  discovered  in  the 
scientific  databases,  ambiguities  in  the  query  formulation  can  be  resolve,  the  user  phrases  the  query  in  an 
incremental  manner  (i.e.,  query  refinement),  the  system  recognizes  query  intended  by  the  user  when  they 
are  incomplete,  and  it  is  application  independent. 

This  section  discusses  the  first  phase  in  the  design  and  development  of  ENTRETIEN,  a  computational 
simulation  model  of  domain  specific  conversation  or  discourse  in  natural  language.  Note  that  the  concept 
of  conversation  is  still  not  well  understood  and  modelized  neither  in  artificial  intelligence,  computational 
linguistics,  nor  cognitive  psychology.  ENTRETIEN  will  attempt  to  explain  how  a  domain  specific  conversa¬ 
tion  is  handled  and  used  by  implementing  its  processing.  It  will  serve  as  a  basis  computational  model  for 
effective  and  efficient  human-computer  interactions.  It  is  therefore  the  essential  element  of  the  interactive 
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natural  language  conversational  interface  of  the  Intelligent  Support  System  (ISS)  for  knowledge  discovery  in 
environmental  databases.  Note  that  the  present  study  only  considers  the  design  of  domain  specific  models, 
not  general  devices.  Note  also  that  changing  the  domain  knowledge  of  the  system  makes  it  able  to  convose 
in  various  and  different  domains,  or  databases. 

There  are  mainly  three  processes  involved  in  handling  a  natural  language  dialogue,  as  depicted  by  Fig¬ 
ure  14.  The  first  process  is  to  understand  the  user’s  input  utterance  (i.e.,  analysis  phase),  and  to  m^  it 
into  an  internal  representation  that  captures  its  intended  meaning.  Second  is  the  reasoning  and  planning 
process  which  decides  on  the  form  and  content  of  the  conversation.  For  example,  it  combines  the  current 
input  utterance  in  the  context  of  the  dialogue  (i.e.,  recognition  and  synthesis  of  discourse),  checks  whether 
knowledge  exists  in  the  knowledge  base  that  may  satisfy  the  user’s  needs,  decides  how  to  represent  and  accu¬ 
mulate  information,  and  decides  whether  and  what  to  communicate  next  to  the  user.  Third  is  the  synthesis 
process  that  generates  answers  to  the  user.  Thus,  the  issues  of  knowledge  representation,  understanding  and 
generating  natural  language  utterances,  how  the  system  modifies  its  knowledge  and  action  by  making  plans, 
emd  the  reasoning  process  will  be  explained. 


User's 

utterance 


Answer 


Figure  14:  Cognitive  tasks  involved  in  a  dialogue 

The  development  of  the  model  of  conversation  has  been  approached  by  the  following  three  steps: 

1.  In  step  1,  we  examined  in  depth  the  process  of  natural  language  understanding  and  processing,  including 
the  process  of  conversation  in  terms  of  artificial  intelligence,  computational  linguistics,  and  cognitive 
psychology.  The  natural  language  is  English.  The  better  the  model  is  similar  to  what  is  actually  known 
about  the  human  cognitive  processes  involved  during  a  dialogue,  the  better  it  will  be  able  to  understand 
and  communicate  with  human  users.  We  extracted  from  this  study  the  important  characteristics  the 
model  should  possess. 

2.  In  step  2,  we  surveyed  and  studied  existing  models  that  deal  with  natural  language  understanding  and 
processing.  This  was  done  to  decide  whether  to  use  one  of  them,  a  combination,  or  to  design  a  new  one, 
to  agree  with  the  necessary  characteristics  determined  at  phase  1.  We  also  decided  what  to  represent, 
and  how  to  represent  it. 

3.  Step  3  is  the  actual  phase  of  the  present  research,  in  which  we  are  developing  ENTRETIEN,  a  conver¬ 
sation  model  to  be  implemented  for  ISS.  The  model  will  be  developed  in  two  phases: 

•  Phase  1  is  the  design  and  developement  of  a  model  of  language  for  single  utteramce.  The  analysis 
and  synthesis  systems  to  be  designed  in  this  phase  process  natural  language  sentences.  The 
analysis  process  decodes  (i.e.,  parses)  the  user  ideas  and  needs  expressed  in  English  into  an 
internal  representation  manageable  by  the  intelligent  system.  Natural  language  processing  (NLP) 
makes  the  machine  understand  a  subset  of  natural  language  by  breaking  up  and  analyzing  the 
intended  meaning  of  the  input  sentence  (user’s  question,  etc.).  It  involves  four  levels  of  analysis: 
morphological,  syntactical,  semantics,  and  pragmatics.  The  purpose  of  any  language  is  to  enable 
communication  of  ideas  and  needs.  In  other  words,  it  is  a  means  for  a  speaker  who  has  something 
in  mind  that  he  wishes  the  recipient  to  understand  or  be  aware  of.  The  synthesis  process  is  merely 
the  reverse  of  the  analysis  process  and  is  used  to  generate  answers  to  the  user  in  a  natural  way. 

•  Phase  2  is  the  design  and  development  of  the  model  of  conversation  per  se.  A  conversation  is 
a  two-way  process  of  exchange  of  information  by  interrelated  sentences.  It  involves  at  'east  two 
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participants.  The  model  of  conversation  is  a  more  general  model  than  the  previous  one  designed 
in  phase  1.  In  fact,  it  contains  the  model  of  language  of  phase  1  (i.e.,  some  knowledge  about  the 
language  common  to  both  participants)  in  addition  to  some  knowledge  of  the  surrounding  world 
(i.e.,  a  representation  of  the  domain  knowledge),  some  knowledge  about  the  other  participant 
(a  model  of  what  the  participant  knows,  believes,  desires,  and  intends),  and  the  understanding 
and  representation  of  the  context  in  which  the  conversation  occurs.  In  this  phase,  the  system 
becomes  a  participant  in  an  interactive  and  intelligent  domain  specific  dialogue  with  the  human 
user,  making  an  evolution  from  the  previous  state  (phase  1)  in  where  it  was  only  a  simple  passive 
listener.  This  has  the  advantages  to  better  define  and  understand  the  user’s  needs  in  querying  the 
knowledge  base  and  to  explain  to  the  user  the  discoveries  contained  in  the  knowledge  base  and 
thus  not  explicitly  understandable  for  human.  The  model  will  include  the  mun  characteristics 
of  a  dialogue  such  as  to  infer  meaning  and  to  find  interrelation  between  successive  sentences 
(questions,  etc.)  of  the  same  conversation,  to  determine  whether  and  when  to  take  the  initiative 
in  the  dialogue,  etc. 

3.1  Processing  single  utterance. 

Processing  single  utterance  implies  understanding  the  input  sentence  (analysis  phase)  and  generating  an 
answer  to  the  user  (synthesis  phase).  The  natural  language  analysis  phase  breaks  up  and  analyzes  a  given 
sentence  to  determine  its  intended  meaning.  This  is  done  in  three  analysis  steps:  morphological,  syntactic, 
and  semantic  analysis.  Note  that  they  may  not  be  performed  in  that  order,  but  rather  each  process  is  used  as 
needed.  This  is  depicted  by  the  functional  diagram  given  in  Figure  15.  To  understand  dialogue,  the  system 
needs  to  have  the  following  kinds  of  knowledge: 

•  Morpbemics  knowledge  contains  the  rules  that  describe  how  units  of  meaning  (morphemes)  are  put 
together  to  form  words.  It  encodes  rules  for  plural,  verb  coiyugation,  etc. 

•  Syntactic  knowledge  contains  the  rules  of  the  grammar  that  describe  the  sentence  formation.  It  puts 
constraints  on  the  number  of  legal  sentences. 

•  Semantics  knowledge  contains  the  meaning  of  words  and  sentences,  hence  it  puts  constraints  on  the 
sentence  understanding. 

Our  computational  simulation  model  of  language  processing  will  contain  the  basis  for  the  analysis  and 
the  synthesis  processes.  The  scope  and  precision  of  the  language  will  enhance  and  grow  with  experience. 

3.1.1  Representing  knowledge  with  case  frames. 

Our  approach  used  case  frames  to  represent  the  knowledge  about  the  surrounding  world.  Originally  developed 
by  Fillmore  [Fil68],  it  is  still  under  active  research  tuid  widely  used  [CH92].  It  is  a  metms  to  organize  efficiently 
large  amounts  of  knowledge  used  for  cognitive  tasks  [BF81].  The  fact  that  it  is  recursive  by  nature,  and  that 
it  combines  ”  bottom-up  recognition  of  key  constituent”  (i.e.,  solve  problem  of  goal  direction)  "with  top-down 
instantiation  of  less  structured  constituents”  [CH92],  makes  it  very  attractive  for  computational  simulation 
of  natural  language  processing. 

A  case  frame  contains  a  bead  concept  (a  verb)  and  a  set  of  semantic  roles  (or  cases).  Roles  may  be 
optional  and  may  contain  default  values;  this  corresponds  to  the  case  when  the  system  infer  expectation  on 
the  meaning  of  the  2ictual  situation.  A  frame  contains  slots  to  be  filled  according  to  the  information  provided 
by  the  dialogue.  Slots  correspond  to  the  type  of  information  that  must  be  found  in  the  conversation  to  be 
understandable.  Links  between  frames  represent  additional  knowledge  required  by  the  context  of  the  current 
situation.  Frames  are  dynamic  entities,  and  thus  may  be  expanded  on  the  basis  of  new  experience  throughout 
the  dialogue. 

Case  frame  is  a  convenient  way  to  combine  both  declarative  and  procedural  knowledge.  Declarative 
knowledge  is  a  standard  data  structure  which  contains  objects,  properties  of  the  objects,  and  relations 
among  objects.  This  corresponds  to  the  declarative  form  of  the  frame  in  which  slots  are  filled  according  to 
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Figure  15:  Analysis  and  synthesis  process 


information  inferred  from  the  input  sentence.  In  this  case,  knowledge  is  accessed  using  descriptions  of  the 
data.  Procedural  kaowiedge  is  not  explicitly  stated.  It  corresponds  to  knowledge  used  when  executing  a 
function.  This  is  implemented  by  allowing  slots  to  functions  used  only  if  the  slots  is  in  use.  While  declarative 
knowledge  is  more  efficient,  easily  accessed  and  modified,  procedural  knowledge  allows  more  flexibility. 

3.1.2  Case  grammar  theory. 

Case  theory  is  an  efficient  formalism  to  represent  sentence  meaning  from  its  concepts.  The  case  structure  has 
been  traditionally  studied  in  linguistics.  The  new  interest  for  case  pammar  is  due  to  Fillmore  [FU68]  who 
intended  to  generalise  the  transformational  generative  pammar  theory.  We  follow  Fillmore  deep  structure 
definitions  of  case  pammar  as  the  basis  for  our  system  [Fil68]  [Fil71].  Deep  case  is  mainly  a  classification 
of  the  noun  phrases  present  in  a  sentence,  and  is  in  accordance  with  their  conceptual  roles  in  the  action 
described  by  the  sentence  [CH92].  Note  that  only  a  small  finite  set  of  conceptual  roles  is  necessary  to 
represent  the  meaning  of  any  sentence  [BM92]  [CH92]. 

Cases  .  Fillmore  defined  the  following  set  of  semantic  roles  (from  [BM92]  and  [FilTl]): 


Agent 

The  investigator  of  the  event,  an  animate  being 

Counter  agent 

The  force  or  resistance  against  which  the  action  is  carried  out 

Object 

The  entity  that  moves  or  changes  or  whose  position  or  existence 
is  in  consideration 

Result 

The  entity  that  comes  into  existence  as  a  result  of  the  event 

Instrument 

The  stimulus  or  immediate  physical  cause  of  an  event 

Source 

The  place  from  which  something  moves 

Goal 

The  place  to  which  something  moves 

Experience 

The  entity  that  recieves  or  accepts  or  experiences  or 
undergoes  the  effect  of  an  event 

Case  frames.  Each  verb  of  the  language  is  represented  by  a  case  frame.  This  represents  the  expectation 
of  the  event  denoted  by  that  verb.  Cases  may  be  optional,  required,  or  not  needed.  A  case  frame  consists  of 
two  components  [CH92]: 
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•  Case  struetura:  set*  of  caaet  that  play  a  role  in  the  event  deectibed  by  the  verb  the  phraae.  Each 
slot  conauta  of  both  a  filler  and  a  lexical  marker. 

•  Selection  reatriction:  aemantic  conatrainta  on  the  choice  of  relevant  caaea.  For  example,  afoit  may 
be  required  to  be  an  animate  object  to  consider  that  caae,  etc. 

Case  grammar  definition.  The  following  defines  (after  Fillmore)  the  case  grammar  rules: 

S=>M  +  P 
M  ^all  modess 
P^V  +  Ci...Cn 
V  =>all  verbs  of  the  language 
Ci=>K  +  NP 
K  ^  (PREP) 

NP  =>  (PREP)  +  {DET)  +  (ADJ\N)  *+N  +  iS\NP) 


A  sentence  is  mainly  composed  of  a  modality  M  and  a  proposition  P.  P  is  a  verb  V  and  a  set  of  cases  Q. 
Each  Ci  is  a  Kasus  K  and  a  noun  phrase  NP.  K  is  a  preposition  introducing  a  noun  phrase;  it  may  not  be 
present. 

Modality.  The  mode  is  a  general  description  of  a  sentence.  Simmons  [Sim73]  defined  the  following  mode 
values  (from  [Har85]): 


Values 

Tense 

Aspect 

Form 

Mood 

Essence 

Modal 

Manner 

Time 

Indicate  time  orientation 

Say  whether  the  event  is  continuing 
Information  about  the  verb  phrase 

Relates  to  the  elements  order 

Say  whether  verb  phrase  is  positive 

Helping  verb  adding  features  to  main  verb 
Indicated  by  adverbial  part  of  senetence 
Indicated  by  adverbial  part  of  senetence 

present,  pas,  future 

perfect,  imperfect 

simple,  emphatic,  progressive 

declarative,  interrogative,  imperative 

positive,  negative,  indeterminate 

may,  can,  must 

adverbial 

adverbial 

3.1.3  Parsing  with  case  frames. 

During  the  analysis  process,  parsing,  cases  of  a  frame  must  be  mapped  to  the  noun  phrase.  For  the  synthesis 
phase  (sentence  generation),  the  concept  to  express  is  mapped  into  the  case  frame  of  interest.  In  both  cases, 
whether  a  slot  is  ooptional,  required,  or  not  needed  decides  on  what  to  map.  The  lexicon  is  used  to  access 
the  word  semantics  knowledge  to  decide  which  case  to  fill  in. 

Case  frame  parsing  algorithm.  Carbonell  and  Hayes  [CH92]  proposes  the  following  algorithm  for  parsing 
with  case  frames. 

•  Browse  ail  case  frame  to  match  input.  If  there  exist  some  csndidate(s)  then  continue  to  next  step,  else 
stop  since  the  parser  cannot  parse  that  input. 

•  Attempt  to  recognize  in  the  input  sentence  each  case  of  the  selected  case  frame. 

1.  If  case  is  marked  lexically,  do  a  match  for  the  case  marker. 

2.  If  the  case  is  marked  positionally,  do  a  match  of  case  filler. 

3.  If  case  could  be  either  way,  search  first  for  lexical  marker,  then  positional  marker  if  failed. 

•  Attempt  to  recognize  all  optional  cases. 


18 


3.1.4  Lexicon. 


The  dictionary  is  a  data  structure  storing  knowledge  about  the  language.  Chunk  of  knowledge  are  represented 
with  frames  and  scripts.  The  lexicon  is  mainly  used  during  the  parsing  process  to  access  syntactic  and 
semantic  information  about  the  words  of  the  language.  Note  that  it  gives  only  meaning  of  words  crmsidered 
individually.  The  lexicon  also  stores  the  case  frames  for  each  verb  of  the  language  as  discussed  previously. 

For  storage  space  purposes,  the  lexicon  only  stores  the  roots  of  the  words.  Related  words  formed  on 
those  roots  are  determined  from  the  information  contained  in  each  entry  word  (whether  and  which  affixes 
are  accepted)  and  during  the  morphological  processing.  This  also  saves  search  time. 

Implementation.  We  intend  to  use  the  expert  system  shell  CLIPS  6.0  to  develop  the  lexicon.  The  choice  is 
justified  by  the  excellent  Rete  algorithm  implemented  in  CLIPS,  which  is  efficient  in  organizing  and  seraching 
for  dictionary  entries. 

3.2  Processing  a  conversation. 

This  section  describes  the  design  of  a  computational  simulation  model  of  conversation,  based  on  human 
cognitive  models,  to  conduct  intelligent  and  interactive  dialogues  with  the  user  to  better  exploit  information 
from  the  knowledge  base  and  to  explain  in  a  natural  way  findings  that  have  been  discovered  in  the  scientific 
databases.  A  conversation,  or  diaJogue,  is  an  extremely  complex  process  [Pop86]  [Har85]  [ErH93]  that  is 
still  not  well  understood  and  modelized  in  artificial  intelligence,  computational  linguistics,  and  cognitive 
psychology.  The  following  proposes  an  approach  in  designing  such  a  model  of  conversation,  limited  to 
discourses.  Discourses  are  domain  specific  conversations.  A  conversation  is  an  interactive  two-way  process 
of  exchange  of  information  by  a  succession  of  interrelated  sentences.  It  involves  at  least  two  participants,  the 
present  case  being  limited  to  two  participants:  a  human  user  and  the  machine.  The  machine  will  interpret 
and  refine  the  user’s  needs  in  an  interactive  manner. 

3.2.1  General  approach. 

The  model  is  based  on  the  model  of  the  language  discussed  in  the  previous  section  to  which  are  added  two 
more  important  features  (cf.  Figure  16): 

•  Pragmatics  knowledge  contains  the  rules  for  the  conversation. 

•  Reasoning  and  planning  process. 


3.2.2  Main  characteristics. 

The  process  of  conversation  should  have  the  following  main  cbuacteristics  to  interpret  the  meaning  of  the 
dialogue. 

•  Tacit  knowledge.  The  model  should  perceive  more  information  that  what  is  given  by  the  sentence. 
Hence,  it  should  get  tacit  knowledge  (i.e.,  presuppositions),  and  connection  between  sentence  (i.e., 
implication  and  inference).  Each  sentence  conveys  additional  indirect  information  that  should  be 
transformed  as  explicit  information.  This  may  be  done  by  analyzing  every  word,  its  use  and  relation 
to  other  words  in  the  sentence,  as  well  as  previous  knowledge  acquired  from  prior  sentences  since  the 
beginning  of  the  conversation. 

•  Approximate  answers.  When  a  participant  X  converses  with  another  participant  Y  (via  questions, 
etc.),  X  anticipates  Y’s  answer.  In  the  case  that  Y  does  not  know,  or  is  not  sure  about  the  answer, 
Y  generates  an  approximate  or  indirect  answer  [Pop86].  Human  always  generate  such  answers,  thus 
the  model  should  be  able  to  handle  the  type  of  answer  that  a  human  user  may  generate.  Approximate 
answers  are  used  when  a  precise  answer  to  the  given  question  gives  no  information,  the  answer  is  not 
known,  and  when  the  system  ask  to  clarify  words  it  did  not  understand. 
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Figure  16:  Case  frame  data  atructure 


•  Vaambiguota  statements.  Statements  generated  by  a  participant  should  be  unambiguous  in  terms  of 
syntax  and  semantics.  Each  participant  should  have  knowledge  about; 

-  a  common  language  to  conduct  a  conversation  (i.e.,  English). 

-  the  surrounding  world,  i.e.,  the  domain  knowledge,  such  as  oceanography. 

-  the  context  in  which  the  conversation  occurs. 

-  the  other  participant  to  help  infer  and  deduce  indirect  sentences,  intended  meanings,  etc. 

•  Anaphoric  references.  To  handle  a  dialogue,  the  mode)  should  be  able  to  recall  references  to  previous 
sentences  to  understand  the  previous  sentence. 

•  Ellipsis.  The  model  should  be  able  to  determine  the  meaning  of  slightly  incomplete  sentences. 

•  Initiative  in  the  conversation.  When  needed,  the  system  should  take  the  initiative  in  the  conversation, 
and  not  be  only  a  passive  question  answering  machine.  This  is  done  to  clarify  points,  to  refine  the 
user’s  needs,  etc. 

•  Synonyms  and  homonyms  handling.  The  representation,  in  terms  of  meaning,  of  synonym  phrases 
should  be  identical,  in  order  to  deduce  the  same  meaning. 

•  Depth  of  understanding. 

3.2.3  Computational  model  of  the  user’s  plans. 

In  a  natural  language  information-seeking  dialogue,  human  beings  infer  plans  and  goals  of  the  speaker. 

Information  seeking  dialogue  is  a  complex  two-way  process  that  involves  two  participants,  in  wldch  the 

information-seeker  (IS)  is  the  human  user  and  the  information-provider  (IP)  is  the  computer  system.  We  want 
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to  base  our  system  on  a  previous  simulation  model  developed  by  [Car90]  which  seems  the  most  appropriate 
actual  system  for  further  enhancements  towards  a  more  robust  system  that  will  be  a  step  closer  to  the  human 
way  of  handling  dialogues.  Our  goal  is  to  palliate  an  obvious  lack  of  efficient  and  effective  communication 
between  the  human  user  seeking  information  and  the  computer  system  willing  to  provide  such  information. 
Human  cognition  can  provide  an  inspiring  model  for  the  design  of  knowledge  systems.  This  is  a  very  on-gmng 
complex  task  that  many  psychologists,  linguists,  and  artificial  intelligence  computer  scientists  are  trying  to 
accomplish  for  several  years.  We  focused  on  the  pian  recognition  aspect  of  the  model,  and  more  specifically 
on  how  to  relate  a  new  utterance  to  the  actual  model  built  from  previous  utterances. 

IS  recognition  of  plan.  A  conversation,  or  dialogue,  is  a  planned  activity.  Plans  and  goals  represent  the 
participant’s  actual  beliefs  rmd  expectations  that  are  used  to  make  the  next  statement  in  the  conversation 
(e.g.,  answer  to  a  question,  transfer  of  information  from  one  person  to  another,  etc.)  [Csr88].  During 
a  dialogue,  the  system  (IP)  dynamically  analyzes  each  utterance  to  understand  the  user  (IS)  underlying 
intentions  to  generate  better  answers.  This  is  done  by  searching  a  space  of  plausible  goals  (i.e.,  partial  plans) 
related  to  the  actual  dialogue  context.  In  fact,  the  system  tries  to  guess  the  most  plausible  goal  to  represent 
the  actual  IS  model.  To  understand  a  discourse  requires  one  to  infer  the  intentions  and  beliefs  of  IS.  This 
requires  one  to  derive  a  plan  with  a  sequence  of  actions  to  achieve.  This  idea  of  making  strategies  to  satisfy 
a  goal  has  been  exploited  for  decades  (e.g.,  [Est78][And80]).  Allen  and  Perrault  [AP80]  developed  a  system 
that  derives  intentions  behind  surface  speech  acts  [^a70],  using  knowledge  about  the  plans  and  beliefs  about 
the  speaker’s  goals.  In  his  work,  plan  inference  is  determine  by  first  order  predicate  logic  rules  and  control 
strategy.  Searle’s  approach  to  formulate  speech  acts  was  based  on  the  idea  that  human  beings  plan  the 
generation  of  utterance  in  natural  language  dialogue,  so  we  must  do  a  model  that  simulates  human  behavior 
in  lainguage  comprehension.  This  idea  was  later  formalized  by  Allen  as  stated  above  [A1183]  [AP80]and  also 
by  Cohen  and  Perrault  [CP79].  The  set  of  possible  plans  are  represented  as  a  hierarchical  structure,  a  tree, 
in  which  like  the  human  memory  is  an  organization  with  a  top-down  recall  scheme.  The  more  the  plan  is 
near  a  leaf,  the  less  likely  it  will  be  considered  next  since  the  less  easier  to  retrieve  from  the  structure  (i.e., 
it  takes  more  effort  to  retrieve  deep  nodes).  Thorndyke  was  the  first  to  address  the  problem  of  memory 
representation  of  connected  utterances  [Tho77]. 

Plan  recognition  in  dialogue.  The  main  issue  is  the  focus  of  attention  in  the  dialogue.  Focus  is  the 
process  of  main  concentration  of  one  participant  on  the  other  participant  knowledge  base  [Gro81].  This  is 
related  to  the  actual  discourse  context.  Due  to  the  large  size  of  one's  knowledge  base,  they  rather  concentrate 
on  the  actual  space  of  the  knowledge  base  that  agrees  with  the  dialogue  current  context.  McKeown  further 
developed  Grosz  theory  by  claiming  that  when  a  speaker  has  the  choice  of  the  topic  of  the  discussion,  this 
speaker  always  chooses  the  most  recently  introduced  topic,  if  he  has  still  something  to  say  about  it  [McK83]. 
This  is  the  basis  of  Carberry’s  theory  on  the  natural  organization  of  the  human  dialogue  [Car90]. 

Objectives.  We  would  like  to  develop  a  robust  computational  dynamic  model  of  the  information-seeker 
(IS)  plans  and  goals  during  an  information-seeking  type  natural  dialogue.  To  do  so,  we  would  like  to  improve 
an  existing  model  developed  by  Cuberry  [Car90]  (that  seems  most  promising  for  enhancements,  by  relaxing 
actual  restrictive  constraints)  towards  a  more  robust  system  of  the  second  generation  of  plan  recognition 
systems.  The  issues  of  the  present  study  refer  then  to  several  fields,  and  therefore  the  criteria  of  success 
of  the  cognitive  simulation  model  will  be  quite  different  for  each  field  [Kie92].  For  Artificial  Intelligence,  in 
contributing  towards  the  construction  of  more  intelligent  machines,  and  more  specifically  to  render  machines 
able  to  communicate  in  natural  language  with  their  human  users;  For  Cognitive  Psychology,  in  attempting  to 
advance  explanations  on  some  cognitive  processes  involved  during  dialogue,  such  as  utterance  understanding 
and  processing,  comprehension,  inference,  knowledge  representation  in  memory  for  information  retrieval,  and 
human  problem  solving  and  planning;  For  Computational  Linguistics,  in  implementing  theories  on  natural 
language  that  implies  to  specify  details  necessary  for  a  computer  to  handle,  but  not  necessary  evident  in 
complex  theories.  Hence,  we  will  center  this  study  among  the  above  fields,  in  reverse  to  the  general  tendency 
to  conduct  studies  in  the  boundaries  of  a  single  field. 
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4  Future  Directions 


Thu  section  proposes  the  future  directions  we  coiuidered  to  follow  in  designing  and  implein«iting  the  com¬ 
plete  system  (Figure  1). 

4.1  Determining  the  number  of  clusters. 

Recall  that  the  data  clustering  process  produces  a  hierarchical  tree  in  where  instances  are  classified  according 
to  some  distance  measure.  Deciding  where  to  cut  the  tree  to  have  the  best  number  of  clusters  is  an  issue 
not  yet  well  formalised  in  statistics.  In  the  future,  we  plan  to  follow  an  ^proach  inspired  by  Romesburg 
[Rom84],  and  described  in  [PP94],  that  seems  to  be  moat  similar  to  the  way  human  beings  select  the  number 
of  clusters  during  taxonomic  cognitive  processes. 

4.2  Data  clustering  algorithms. 

We  intend  to  use  more  computationally  efficient  data  clustering  algorithms,  such  those  contained  in  the 
ClusPak  package  developed  by  Besdek  and  Pal  [BP92b]  [BPd2a].  Future  tasks  include  to  develop  an  interface 
between  the  system  and  this  package. 

4.3  Incremental  Updating  of  the  Knowledge  Base  System  Design. 

The  system  for  incrementally  updating  the  knowledge  base  (lUKBS)  is  intended  to  combine  every  new  set 
of  production  rules  that  describe  the  new  data  inserted  in  the  databases  with  the  previous  rules  contained  in 
the  knowledge  base  of  an  expert  system.  The  intent  is  to  keep  pace  with  the  changes  that  occur  in  the  data, 
since  large  scientific  databases  are  dynamic  entities.  The  issues  in  updating  incrementally  the  knowledge 
base  involve  cheeking  for  rules  consistency,  i.e.,  to  carefully  manage  the  database  of  production  rules  to 
update  accordingly  every  other  rule  that  is  modified,  either  directly  or  indirectly,  each  time  a  new  rule  is 
added  or  removed. 

4.4  Use  of  an  expert  system  shell. 

Since  the  main  goal  is  to  use  the  knowledge  base  for  information  retrieval  (answering  user’s  queries)  and  to 
infer  new  knowledge  from  the  data,  it  is  of  interest  to  transform  the  production  rules  generated  by  Quinlan’s 
C4.5rules  into  CLIPS  format.  CLIPS  is  an  expert  system  shell  developed  by  NASA  [NAS91].  This  will 
allow  us  to  combine  them  with  the  domain  knowledge  (environmental  definitions,  etc.)  already  captured  in 
a  frame-based  semantic  network  implemented  using  CLIPS  during  a  previous  work  [PP94]. 

4.5  Implementing  ENTRETIEN. 

This  concerns  implementing  every  process  discussed  in  the  natural  language  interface  system  (cf.  Section 
3). 

5  Conclusion 

This  research  clearly  shows  that  combining  machine  learning  techniques  (data  clustering  and  induction 
learning  by  decision  trees)  with  production  rules  is  an  effective  mean  for  automatic  data  analysis  of  huge 
and  rapidly  changing  aunounts  of  observational  raw  data  in  scientific  databases.  In  terms  of  expert  ssrstems, 
automatic  rule  generation  from  non-trivial  knowledge  contained  in  scientific  databases  eases  the  knowledge 
acquisition  bottleneck  that  usually  exists  in  extracting  knowledge  from  human  experts.  Further  developments 
of  the  prototype  seems  promising  in  solving  the  problem  of  automatic  data  analysis  for  information  retrieval 
in  scientific  databases. 

We  believe  that  constructing  the  natural  language  interface,  described  in  this  report,  in  which  the  com¬ 
puter  ’’speaks”  the  same  language  as  the  human  user,  is  of  great  interest  to  efficiently  exploit  information 
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non-triviaily  stored  io  databases.  We  want  to  go  a  step  further  in  building  an  intelligent  machine  capa¬ 
ble  of  communicating  efficiently,  as  human  do  naturally.  Language  is  the  support  of  information  transfer. 
Communication  is  then  the  underlying  support  of  intelligence. 
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7  Appendices 

7.1  Gain  criterion 

The  notion  of  gain{X)  mentioned  in  the  section  about  the  decision  tree  refers  to  the  gain{X)  function  used 
in  the  gain  criterion. 

After  T  has  been  partitioned  with  respect  to  the  n  outcomes  of  a  test  X,  the  expected  information 
requirement  can  be  found  as  the  weighted  sum  over  the  subsets  as  infor{T)  =  x  info(T) 

Essentially,  info(T)  measures  the  avery  amount  of  information  needed  to  identify  the  class  of  a  case  in 
T.  info{T)  =  -T,U  X  1°83 

Therefore,9atn(X)  =  info{T)  -  tn/oc(T)meaaures  the  information  gained  by  partitioning  T  according 
to  the  test  X. 

7.2  Day  1:  Decision  tree  obtained  for  the  clusters’  description 

Read  16000  cases  (6  attributes)  froa  tsl.data 
Decision  Tree; 
teaperature  <-  29.0741  : 

I  teaq>erature  <-  28.074  : 

I  I  salinity  <»  36.1638  :  C2  (26.0/1.0) 

I  I  salinity  >  36.1638  : 

I  I  I  salinity  >  36.1699  :  C3  (6892.0/1.0) 

I  I  I  salinity  <-  36.1699  : 

I  I  I  I  teaperature  <=  28.0632  :  C3  (17.0) 

I  I  I  I  teaperature  >  28.0632  :  C2  (8.0) 

I  teaperature  >  28.074  : 

I  I  teaperature  <s  28.0792  : 

I  I  I  salinity  <»  36.2189  :  C2  (22.0) 

I  I  I  salinity  >  36.2189  :  C3  (20.0) 

I  I  teaperature  >  28.0792  : 

I  I  I  teaperature  <s  28.0917  : 
till  salinity  <=  36.2798  :  C2  (61.0) 

I  I  I  I  salinity  >  36.2798  : 

I  I  I  I  I  salinity  >  36.3016  :  C3  (19.0) 

I  I  I  I  I  salinity  <s  36.3016  : 

I  I  I  I  I  I  tei^erature  <s  28.087  :  C3  (6.0) 

I  I  I  I  I  I  teaperature  >  28.087  : 

I  I  I  I  I  I  I  salinity  <»  36.2903  :  C3  (2.0) 

I  I  I  I  I  I  I  salinity  >  36.2903  :  C2  (3.0) 
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I  I  I  t«ap«ratiir«  >  28.0917  : 

I  I  I  I  t«iip«ratiir«  >  28.0948  :  C2  (8496.0/1.0) 

I  I  I  I  twparator*  <•  28.0948  : 

I  I  I  I  I  colium  >  34  :  C2  (14.0) 

I  I  I  I  I  coluan  <s  34  : 

I  I  I  I  I  I  aalinity  >  36.3186  :  C2  (6.0) 

I  I  I  I  I  I  salinity  <-  36.3186  : 

I  I  I  I  I  I  I  ts^>aratars  <«  28.0936  :  C3  (3. 0/1.0) 

I  I  I  I  I  I  I  taatparatnra  >  28.0936  :  C2  (3.0) 

tanparatnra  >  29.0741  : 

I  tsaparattira  <»  29.8921  :  Cl  (262.0/1.0) 

I  tsaparattirs  >  29.8921  : 

I  I  tsaparature  >  29.9348  :  CO  (2121.0) 

I  I  teaparatura  <«  29.9348  : 

I  I  I  colnan  <=  86  :  Cl  (9.0) 

I  I  I  colnan  >  86  :  CO  (22.0) 

Siapliiiad  Dacision  Traa: 
taaparatnra  <=  29.0741  : 

I  taaparatnra  <=  28.074  : 

I  I  salinity  <=  36.1638  :  C2  (25.0/2.6) 

I  I  salinity  >  36.1638  : 

I  I  I  salinity  >  36.1699  :  C3  (5892.0/2.6) 

I  I  I  salinity  <-  36.1599  : 

I  I  I  I  taaparatnra  <=  28.0632  :  C3  (17.0/1.3) 

till  taaparatnra  >  28.0632  :  C2  (8. 0/1. 3) 

I  taaparatnra  >  28.074  : 

I  I  taaparatnra  <«  28.0792  : 

I  I  I  salinity  <»  36.2189  :  C2  (22.0/1.3) 

I  I  I  salinity  >  36.2189  :  C3  (20.0/1.3) 

I  I  taaparatnra  >  28.0792  : 

I  I  I  taaparatnra  >  28.0917  :  C2  (6621.0/5.1) 

I  I  I  taaparatnra  <s  28.0917  : 

I  I  I  I  salinity  <=  36.2798  :  C2  (61.0/1.4) 

I  I  I  I  salinity  >  36.2798  : 

I  I  I  I  I  salinity  >  36.3016  :  C3  (19.0/1.3) 

I  I  I  I  I  salinity  <-  36.3016  : 

I  I  I  I  I  I  taaparatnra  <-  28.087  :  C3  (6. 0/1. 2) 

I  I  I  I  I  I  taaparatnra  >  28.087  : 

I  I  I  I  I  I  I  salinity  <=  36.2903  :  C3  (2. 0/1.0) 

I  I  I  I  I  I  I  salinity  >  36.2903  :  C2  (3.0/1. 1) 

taaparatnra  >  29.0741  : 

I  taaparatnra  <s  29.8921  :  Cl  (252.0/2.6) 

I  taaparatnra  >  29.8921  : 

I  I  taaparatnra  >  29.9348  :  CO  (2121.0/1.4) 

I  I  taaparatnra  <-  29.9348  : 

I  i  I  colnan  <«  86  :  Cl  (9. 0/1. 3) 

I  I  I  colnan  >  86  :  CO  (22.0/1.3) 

Evaluation  on  training  data  (16000  itaas): 


Bafora  Pruning 


Altar  Pruning 


Siz«  Errors  Siss  Errors  Estiaats 
39  6(  O.OX)  31  6(  0.0%)  (  0.2%)  « 

7.3  Day  1:  Production  rules  generated  from  the  decision  tree 

Road  16000  casos  (6  attributos)  irom  tsl 
Processing  troe  0 
Final  miss  Irosi  tree  0: 

Rnle  19:  testperatnre  >  29.9348 
->  class  CO  [99.9%] 

Rnle  18:  teaperatnre  >  29.8921 
colnan  >  86 
->  class  CO  [99.9%] 

Rule  17:  temperature  >  29.0741 
temperature  <s  29.9348 
column  <-  86 
->  class  Cl  [99.4%] 

Rule  16:  temperature  >  29.0741 
teaperatnre  <-  29.8921 
->  class  Cl  [99.0%] 

Rule  2:  salinity  >  36.1638 
temperature  <-  28.0632 
->  class  C3  [100.0%] 

Rule  6:  salinity  >  36.2189 
temperature  <-  28.0792 
->  class  C3  [100.0%] 

Rule  8:  salinity  >  36.2798 
temperature  <-  28.087 
->  class  C3  [100.0%] 

Rule  11:  salinity  >  36.3016 
temperature  <=  28.0917 
->  class  C3  [100.0%] 

Rule  4:  salinity  >  36.1699 
temperature  <=  28.074 
->  class  C3  [100. 0%] 

Rnle  7:  salinity  <-  36.2798 
temperature  >  28.0792 
temperature  <=  29.0741 
->  class  C2  [100.0%] 

Rnle  14:  temperature  >  28.0917 
temperature  <-  29.0741 
->  class  C2  [99.9%] 

Rule  6:  salinity  <»  36.2189 
temperature  >  28.074 
temperature  <-  29.0741 
->  class  C2  [99.9%] 

Rule  10:  salinity  <»  36.3016 
temperature  >  28.087 
temperature  <-  29.0741 
->  class  C2  [99.9%] 

Rule  3:  salinity  <s  36.1699 
temperature  >  28.0632 


25 


r 


twpwrator*  <>  29.0741 
->  elu«  C2  [99. 9X] 

Rnl*  1:  lallnity  <»  36.1638 
twpcr&tura  <•  29.0741 
->  clus  C2  [99.8%] 

Datault  class:  C2 

Evaluation  on  training  data  (16000  itasis): 


Rnls 

Sizs 

Error 

Ussd 

Wrong 

Advantags 

19 

1 

0.1% 

2121 

0  (0.0%) 

66  (6610) 

CO 

18 

2 

0.1% 

22 

0  (0.0%) 

22  (22)0) 

CO 

17 

3 

0.6% 

238 

0  (0.0%) 

9  (910) 

Cl 

16 

2 

1.0% 

23 

1  (4.3%) 

22  (2210) 

Cl 

2 

2 

0.0% 

6861 

0  (0.0%) 

17  (1710) 

C3 

6 

2 

0.0% 

66 

0  (0.0%) 

6  (610) 

C3 

8 

2 

0.0% 

17 

0  (0.0%) 

6  (610) 

C3 

11 

2 

0.0% 

8 

0  (0.0%) 

8  (810) 

C3 

4 

2 

0.0% 

13 

1  (7.7%) 

11  (1211) 

C3 

7 

3 

0.0% 

6269 

1  (0.0%) 

0  (010) 

C2 

14 

2 

0.1% 

1323 

2  (0.2%) 

0  (010) 

C2 

6 

3 

0.1% 

22 

0  (0.0%) 

0  (010) 

C2 

10 

3 

0.1% 

6 

2  (40.0%) 

0  (OlO) 

C2 

3 

3 

0.1% 

27 

0  (0.0%) 

0  (OlO) 

C2 

1 

2 

0.2% 

6 

1  (16.7%) 

0  (OlO) 

C2 

Tsstsd 

16000,  srrors  8  (0.1%)  « 

(a) 

(b) 

(c)  (d)  <-classilisd  as 

2143 

1 

(a): 

class  CO 

260 

1 

(b): 

class  Cl 

6636 

1  (c): 

class  C2 

6  6963  (d): 

class  C3 

7.4  Day  2:  Decision  tree  obtained  for  the  clusters’  description 

Road  16000  casos  (6  attributss)  froB  t82.data 
Dscision  Trss: 
tss^sraturo  <-  28.9968  : 

I  tsaparaturs  <-  28.1988  : 

I  I  taaparaturs  <»  28.1669  :  C3  (6298.0/1.0) 

I  I  tsaparaturs  >  28.1669  : 

I  I  I  salinity  <=  36.2868  : 

I  I  I  I  salinity  >  36.2067  :  C2  (67.0) 

I  I  I  I  salinity  36.2067  : 

I  I  I  I  I  salinity  >  36.1474  :  C3  (9.0) 

I  I  I  I  I  salinity  <-  36.1474  : 

I  I  I  I  I  I  salinity  <=  36.1296  :  C2  (66.0) 

I  I  I  I  I  I  salinity  >  36.1296  : 

I  I  I  I  I  I  I  tsaparaturs  <s  28.1838  :  C3  (8. 0/1.0) 

I  I  I  I  I  I  I  tsaparaturs  >  28.1838  :  C2  (8.0) 

I  I  I  salinity  >  36.2868  : 
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till  tM^wratvra  <«  28.19S7  :  C3  (107.0) 

I  I  I  I  tM^aratur*  >  28.1967  : 

I  I  I  I  I  aalinity  <»  36.3182  :  C2  (3.0) 

I  I  I  I  I  salinity  >  36.3162  :  C3  (10.0) 

I  taaparatnra  >  28.1988  : 

I  I  tai^ratnra  >  28.2035  :  C2  (6227.0) 

I  I  taaparatnra  <«  28.2036  : 

I  I  I  salinity  <»  36.318  :  C2  (22.0) 

I  I  I  salinity  >  36.318  :  C3  (3.0) 

taaparatnra  >  28.9968  : 

I  tastparatnra  <>  29.9177  :  Cl  (243.0) 

I  taaparatnra  >  29.9177  : 

I  I  taaparatnra  >  29.9634  :  CO  (2886.0) 

I  I  taaq>aratnra  <=  29.9634  : 

I  I  I  salinity  <-  36.0446  :  CO  (34.0/1.0) 

I  I  I  salinity  >  36.0446  :  Cl  (19.0) 

Sisqpliliad  Daeision  Traa: 
taaparatnra  <»  28.9968  : 

I  taaparatnra  <=  28.1988  : 

I  I  taaparatnra  <s  28.1669  :  C3  (6298.0/2.6) 

I  I  taaparatnra  >  28.1669  : 

I  I  I  salinity  <-  36.2868  : 

lilt  salinity  >  36.2067  :  C2  (67.0/1.4) 

till  salinity  <=  36.2067  : 

I  I  I  I  I  salinity  <«  36.1296  :  C2  (66.0/1.4) 

I  I  I  I  I  salinity  >  36.1296  : 

I  I  I  I  I  I  taaparatnra  <«  28.1838  :  C3  (16.0/2.6) 

I  I  I  I  I  I  taaparatnra  >  28.1838  :  C2  (9. 0/2. 4) 

I  I  I  salinity  >  36.2868  : 

till  taaparatnra  <-  28.1967  :  C3  (107.0/1.4) 
till  taaparatnra  >  28.1967  : 

I  I  I  I  I  salinity  <=  36.3162  :  C2  (3.0/1. 1) 

I  I  I  I  I  salinity  >  36.3162  :  C3  (10.0/1.3) 

I  taaparatnra  >  28.1988  : 

I  I  taaparatnra  >  28.2036  :  C2  (5227.0/1.4) 

I  I  taaparatnra  <»  28.2035  : 

I  I  I  salinity  <«  36.318  :  C2  (22.0/1.3) 

I  I  I  salinity  >  36.318  :  C3  (3.0/1. 1) 

taaparatnra  >  28.9968  : 

I  taaparatnra  <»  29.9177  :  Cl  (243.0/1.4) 

I  taaparatnra  >  29.9177  : 

I  I  taaparatnra  >  29.9634  :  CO  (2886.0/1.4) 

I  I  taaparatnra  <»  29.9634  : 

I  I  I  salinity  <=  36.0446  :  CO  (34.0/2.6) 

I  I  I  salinity  >  36.0445  :  Cl  (19.0/1.3) 


Evalnation  on  training  data  (15000  itaas): 
Balora  Pmning  Aftar  Pmning 
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Siz«  Errors  Sis#  Errors  Estiuto 
31  3(  O.OX)  2®  4(  O.OX)  (  0.2X)  « 

7.5  Day  2:  Production  rules  generated  from  the  decision  tree 

Road  15000  casos  (S  attributos)  froa  ts2 
Procsssiag  trss  0 
Final  miss  froa  trss  0: 

Ruls  13:  tsapsratnrs  >  28.9968 
tsapsratnrs  <>  29.9177 
->  class  Cl  [99. 4X] 

Ruls  15:  salinity  >  36.0445 
tsapsratnrs  >  28.9968 
tsapsratnrs  <»  29.9634 
->  class  Cl  [98.6%] 

Rnls  16:  tsapsratnrs  >  29.9634 
->  class  CO  [100. OX] 

Rnls  14:  salinity  <•  36.0445 
tsapsratnrs  >  29.9177 
->  class  CO  [99. 8X] 

Ruls  12:  tsapsratnrs  >  28.2035 
tsapsratnrs  <-  28.9968 
->  class  C2  [100. OX] 

Ruls  10:  salinity  <=  36.318 
tsapsratnrs  >  28.1988 
tsapsratnrs  <~  28.9968 
->  class  C2  [100. OX] 

Rnls  8:  salinity  <«  36.3162 
tsapsratnrs  >  28.1957 
tsapsratnrs  <»  28.9968 
->  class  C2  [100. OX] 

Ruls  6:  salinity  >  36.2057 
salinity  <s  36.2868 
tsapsratnrs  >  28.1669 
->  class  C2  [99. 9X] 

Rnls  4:  salinity  <=  36.2868 
tsapsratnrs  >  28.1838 
tsapsratnrs  <=  28.9968 
->  class  C2  [99. 9X] 

Rnls  2:  salinity  <=  36.1296 
tsapsratnrs  >  28.1669 
tsapsratnrs  <=  28.9968 
->  class  C2  [99. 9X] 

Rnls  7:  salinity  >  36.2868 
tsapsratnrs  <»  28.1957 
->  class  C3  [100. OX] 

Rnls  11:  salinity  >  36.318 
tsapsratnrs  <=  28.2035 
->  class  C3  [100. OX] 

Ruls  1:  tsapsratnrs  <s  28.1669 
->  class  C3  [100. OX] 

Ruls  5:  salinity  >  36.1474 
salinity  <s  36.2057 
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t«qp«ratQr«  <•  28.1988 
->  class  C3  [99. 6X] 
Dslaalt  class:  C3 


Bvalnatioa  on  training  data  (16000  itsas): 


Rule 

Sise 

Error 

Used 

Wrong 

Advantage 

13 

2 

0.6X 

243 

0  (O.OX) 

162  (16210)  Cl 

16 

3 

1.4X 

19 

0  (O.OX) 

19  (1910)  Cl 

16 

1 

o.ox 

2886 

0  (O.OX) 

1607  (160710)  CO 

14 

2 

0.2X 

34 

1  (2.9X) 

33  (3310)  CO 

12 

2 

o.ox 

5227 

0  (O.OX) 

124  (12410)  C2 

10 

3 

O.OX 

22 

0  (O.OX) 

2  (210)  C2 

8 

3 

O.OX 

8 

0  (O.OX) 

3  (310)  C2 

3 

O.lX 

67 

0  (O.OX) 

31  (3110)  C2 

4 

3 

O.IX 

31 

1  (3.2X) 

6  (611)  C2 

2 

3 

O.lX 

29 

0  (O.OX) 

29  (2910)  C2 

7 

2 

O.OX 

4844 

0  (O.OX) 

0  (OlO)  C3 

11 

2 

O.OX 

13 

0  (O.OX) 

0  (OlO)  C3 

1 

1 

O.OX 

1561 

1  (O.lX) 

0  (OlO)  C3 

6 

3 

0.4X 

8 

0  (O.OX) 

0  (OlO)  C3 

Tsstsd  ISOOO,  errors  4  (0.0%)  « 

(a)  (b)  (c)  (d)  <-classifisd  as 


2919  (a):  class  CO 

1  262  (b):  class  Cl 

6383  2  (c):  class  C2 

1  6432  (d) :  class  C3 

7.6  Day  3:  Decision  tree  obtained  for  the  clusters*  description 

Road  15000  cases  (6  attributes)  Iron  ts3.data 
Decision  Tree: 
tenperatnre  29.6616  : 

I  tenperatnre  >  28.9764  :  C2  (168.0) 

I  tenperatnre  <«  28.9764  : 

I  I  tenperatnre  <>  28.8768  :  C3  (11782.0) 

I  I  tenperatnre  >  28.8768  : 

I  I  I  salinity  <-  36.0483  :  C3  (23.0/1.0) 

I  I  I  salinity  >  36.0483  :  C2  (2.0) 

tenperatnre  >  29.6616  : 

I  tenperatnre  <■  30.5747  : 

I  I  tMiperatnre  <«  30.6622  :  CO  (1839.0) 

I  I  tenperatnre  >  30.5622  : 

I  I  I  salinity  <>  36.0993  :  CO  (17.0) 

I  I  I  salinity  >  36.0993  :  Cl  (8. 0/1.0) 

I  tenperatnre  >  30.6747  : 

I  I  tenperatnre  <>  30.6912  : 

I  I  I  salinity  <>  36.0871  :  CO  (11.0) 

I  I  I  salinity  >  36.0871  :  Cl  (16.0) 


I  I  tMp«r«tw«  >  30.6912  : 

I  I  I  t«[ip«ratiir«  >  30.6066  :  Cl  (1109.0) 

I  I  I  tan^cratnrs  <«  30.6066  : 

1)11  salinity  <•  36.0636  :  CO  (2.0) 

I  I  I  I  salinity  >  36.0636  :  Cl  (23.0) 

Btaluation  on  training  data  (16000  itaas): 

Bstors  Pmning  Aftsr  Pruning 

Sins  Errors  Sins  Errors  Estiamta 
23  2(  O.OX)  23  2(  O.OX)  (  O.lX)  « 

7.7  Day  3:  Production  rules  generated  from  the  decision  tree 

Raad  16000  casas  (6  attribntas)  Iron  ts3 
Procassing  traa  0 
Final  mlaa  froa  traa  0; 

Rnla  6:  taaparatnra  >  29.6616 
taaparatnra  <■  30.6622 
->  class  CO  [99. 9X] 

Rnla  8:  salinity  <«  36.0871 
taaparatnra  >  29.6616 
taaparatnra  <-  30.6912 
->  class  CO  [99.9X3 
Rnla  10:  salinity  <*  36.0636 
taaparatnra  >  29.6616 
taaparatnra  <»  30.6066 
->  class  CO  [99.9X3 
Rnla  4:  taaparatnra  >  28.9764 
taaparatnra  <■  29.6616 
->  class  C2  [99.2X3 
Rnla  3:  salinity  >  36.0483 
taaparatnra  >  28.8768 
taaparatnra  <«  29.6616 
->  class  C2  [96.9X3 
Rnla  12:  taaparatnra  >  30.6066 
->  class  Cl  [99.9X3 
Rnla  11:  salinity  >  36.0636 
taaparatnra  >  30.6912 
->  class  Cl  [99.9X3 
Rnla  9:  salinity  >  36.0871 
taaparatnra  >  30.5747 
->  class  Cl  [99.9X3 
Rnla  7;  salinity  >  36.0993 
toqparatnra  >  30.6622 
->  class  Cl  [99.7X3 
Rnla  2:  taaparatnra  <«  28.9764 
->  class  C3  [100.0X3 
Daianlt  class:  CO 

Eralnation  on  training  data  (16000  itaas): 
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Rule 

Sine 

Error 

Used 

Vrong 

Advantage 

S 

2 

O.IX 

1839 

0  (O.OX) 

0  (010) 

CO 

8 

3 

O.lX 

27 

0  (O.OX) 

0  (010) 

CO 

10 

3 

O.IX 

2 

0  (O.OX) 

0  (010) 

CO 

4 

2 

0.8X 

168 

0  (O.OX) 

126  (12610)  C2 

3 

3 

3.  IX 

2 

0  (O.OX) 

2  (210) 

C2 

12 

1 

O.IX 

1109 

0  (O.OX) 

53  (5310) 

Cl 

11 

2 

O.IX 

23 

0  (O.OX) 

4  (410) 

Cl 

9 

2 

O.IX 

16 

0  (O.OX) 

2  (2t0) 

Cl 

7 

2 

0.3X 

8 

1  (12. 5X) 

6  (711) 

Cl 

2 

1 

O.OX  11805 

1  (O.OX) 

11804  (1180410)  C3 

Tasted  16000,  errors  2  (O.OX)  « 

(a) 

(b) 

(e)  (d)  <-elassified  as 

1869 

1 

(a): 

class  CO 

1155 

(b): 

class  Cl 

170 

1  (c): 

class  C2 

11804  (d): 

class  C3 
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